Information retrieval  IR  is the science of searching for documents  for information within documents and for metadata about documents  as well as that of searching relational databases and the World Wide Web  There is overlap in the usage of the terms data retrieval  document retrieval  information retrieval  and text retrieval  but each also has its own body of literature  theory  praxis and technologies  IR is interdisciplinary  based on computer science  mathematics  library science  information science  information architecture  cognitive psychology  linguistics  statistics and physics 
Automated information retrieval systems are used to reduce what has been called  information overload   Many universities and public libraries use IR systems to provide access to books  journals and other documents  Web search engines are the most visible IR applications 
Contents  hide 
1 History
1 1 Timeline
2 Overview
3 Performance measures
3 1 Precision
3 2 Recall
3 3 Fall-Out
3 4 F-measure
3 5 Average precision of precision and recall
4 Model types
4 1 First dimension  mathematical basis
4 2 Second dimension  properties of the model
5 Major figures
6 Awards in the field
7 See also
8 References
9 External links
 History

	But do you know that  although I have kept the diary  on a phonograph  for months past  it never once struck me how I was going to find any particular part of it in case I wanted to look it upI	
Dr Seward  Bram Stoker s Dracula  1897
The idea of using computers to search for relevant pieces of information was popularized in an article As We May Think by Vannevar Bush in 1945   First implementations of information retrieval systems were introduced in the 1950s and 1960s  By 1990 several different techniques had been shown to perform well on small text corpora  several thousand documents   
In 1992 the US Department of Defense  along with the National Institute of Standards and Technology  NIST   cosponsored the Text Retrieval Conference  TREC  as part of the TIPSTER text program  The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection  This catalyzed research on methods that scale to huge corpora  The introduction of web search engines has boosted the need for very large scale retrieval systems even further 
The use of digital methods for storing and retrieving information has led to the phenomenon of digital obsolescence  where a digital resource ceases to be readable because the physical media  the reader required to read the media  the hardware  or the software that runs on it  is no longer available  The information is initially easier to retrieve than if it were on paper  but is then effectively lost 
 Timeline
Before 1900s
1880s  Herman Hollerith invents the recording of data on a machine readable medium
1890 Hollerith cards  key punches and tabulators used to process the 1890 US Census data 
1940s1950s
late-1940s  The US military confronted problems of indexing and retrieval of wartime scientific research documents captured from Germans 
1945  Vannevar Bush s As We May Think appeared in Atlantic Monthly
1947  Hans Peter Luhn  research engineer at IBM since 1941  began work on a mechanized  punch card based system for searching chemical compounds 
1950s  Growing concern in the US for a  science gap  with the USSR motivated  encouraged funding  and provided a backdrop for mechanized literature searching systems  Allen Kent et al   and the invention of citation indexing  Eugene Garfield  
1950  The term  information retrieval  may have been coined by Calvin Mooers 
1951  Philip Bagley conducted the earliest experiment in computerized document retrieval in a master thesis at MIT  
1955  Allen Kent joined Case Western Reserve University  and eventually becomes associate director of the Center for Documentation and Communications Research  That same year  Kent and colleagues publish a paper in American Documentation describing the precision and recall measures  as well as detailing a proposed  framework  for evaluating an IR system  which includes statistical sampling methods for determining the number of relevant documents not retrieved 
1958  International Conference on Scientific Information Washington DC included consideration of IR systems as a solution to problems identified  See  Proceedings of the International Conference on Scientific Information  1958  National Academy of Sciences  Washington  DC  1959 
1959  Hans Peter Luhn published  Auto-encoding of documents for information retrieval  
1960s 
early-1960s  Gerard Salton began work on IR at Harvard  later moved to Cornell 
1960  Melvin Earl  Bill  Maron and J  L  Kuhns published  On relevance  probabilistic indexing  and information retrieval  in Journal of the ACM 7 3  216244  July 1960 
1962 
Cyril W  Cleverdon published early findings of the Cranfield studies  developing a model for IR system evaluation  See  Cyril W  Cleverdon   Report on the Testing and Analysis of an Investigation into the Comparative Efficiency of Indexing Systems   Cranfield Coll  of Aeronautics  Cranfield  England  1962 
Kent published Information Analysis and Retrieval
1963 
Weinberg report  Science  Government and Information  gave a full articulation of the idea of a  crisis of scientific information   The report was named after Dr  Alvin Weinberg 
Joseph Becker and Robert M  Hayes published text on information retrieval  Becker  Joseph; Hayes  Robert Mayo  Information storage and retrieval  tools  elements  theories  New York  Wiley  1963  
1964 
Karen Sprck Jones finished her thesis at Cambridge  Synonymy and Semantic Classification  and continued work on computational linguistics as it applies to IR
The National Bureau of Standards sponsored a symposium titled  Statistical Association Methods for Mechanized Documentation   Several highly significant papers  including G  Salton s first published reference  we believe  to the SMART system 
mid-1960s 
National Library of Medicine developed MEDLARS Medical Literature Analysis and Retrieval System  the first major machine-readable database and batch retrieval system
Project Intrex at MIT
1965  J  C  R  Licklider published Libraries of the Future
1966  Don Swanson was involved in studies at University of Chicago on Requirements for Future Catalogs
late-1960s  F  W  Lancaster completed evaluation studies of the MEDLARS system and published the first edition of his text on information retrieval 
1968 
Gerard Salton published Automatic Information Organization and Retrieval 
J  W  Sammon s RADC Tech report  Some Mathematics of Information Storage and Retrieval     outlined the vector model 
1969  Sammon s  A nonlinear mapping for data structure analysis   IEEE Transactions on Computers  was the first proposal for visualization interface to an IR system 
1970s
early-1970s 
First online systemsNLM s AIM-TWX  MEDLINE; Lockheed s Dialog; SDC s ORBIT
Theodor Nelson promoting concept of hypertext  published Computer Lib/Dream Machines
1971  N  Jardine and C  J  Van Rijsbergen published  The use of hierarchic clustering in information retrieval   which articulated the  cluster hypothesis    Information Storage and Retrieval  7 5   pp  217240  December 1971 
1975  Three highly influential publications by Salton fully articulated his vector processing framework and term discrimination model 
A Theory of Indexing  Society for Industrial and Applied Mathematics 
A Theory of Term Importance in Automatic Text Analysis  JASIS v  26 
A Vector Space Model for Automatic Indexing  CACM 18 11 
1978  The First ACM SIGIR conference 
1979  C  J  Van Rijsbergen published Information Retrieval  Butterworths   Heavy emphasis on probabilistic models 
1980s
1980  First international ACM SIGIR conference  joint with British Computer Society IR group in Cambridge
1982  Belkin  Oddy  and Brooks proposed the ASK  Anomalous State of Knowledge  viewpoint for information retrieval  This was an important concept  though their automated analysis tool proved ultimately disappointing 
1983  Salton  and M  McGill  published Introduction to Modern Information Retrieval  McGraw-Hill   with heavy emphasis on vector models 
mid-1980s  Efforts to develop end user versions of commercial IR systems 
19851993  Key papers on and experimental systems for visualization interfaces 
Work by D  B  Crouch  Robert R  Korfhage  M  Chalmers  A  Spoerri and others 
1989  First World Wide Web proposals by Tim Berners-Lee at CERN 
1990s
1992  First TREC conference 
1997  Publication of Korfhage s Information Storage and Retrieval  with emphasis on visualization and multi-reference point systems 
late-1990s  Web search engines implementation of many features formerly found only in experimental IR systems  Search engines become the most common and maybe best instantiation of IR models  research and implementation 
 Overview

An information retrieval process begins when a user enters a query into the system  Queries are formal statements of information needs  for example search strings in web search engines  In information retrieval a query does not uniquely identify a single object in the collection  Instead  several objects may match the query  perhaps with different degrees of relevancy 
An object is an entity which keeps or stores information in a database  User queries are matched to objects stored in the database  Depending on the application the data objects may be  for example  text documents  images or videos  Often the documents themselves are not kept or stored directly in the IR system  but are instead represented in the system by document surrogates 
Most IR systems compute a numeric score on how well each object in the database match the query  and rank the objects according to this value  The top ranking objects are then shown to the user  The process may then be iterated if the user wishes to refine the query 
 Performance measures

Main article  Precision and Recall
Many different measures for evaluating the performance of information retrieval systems have been proposed  The measures require a collection of documents and a query  All common measures described here assume a ground truth notion of relevancy  every document is known to be either relevant or non-relevant to a particular query  In practice queries may be ill-posed and there may be different shades of relevancy 
 Precision
Precision is the fraction of the documents retrieved that are relevant to the user s information need 

In binary classification  precision is analogous to positive predictive value  Precision takes all retrieved documents into account  It can also be evaluated at a given cut-off rank  considering only the topmost results returned by the system  This measure is called precision at n or P@n 
Note that the meaning and usage of  precision  in the field of Information Retrieval differs from the definition of accuracy and precision within other branches of science and technology 
 Recall
Recall is the fraction of the documents that are relevant to the query that are successfully retrieved 

In binary classification  recall is called sensitivity  So it can be looked at as the probability that a relevant document is retrieved by the query 
It is trivial to achieve recall of 100% by returning all documents in response to any query  Therefore recall alone is not enough but one needs to measure the number of non-relevant documents also  for example by computing the precision 
 Fall-Out
The proportion of non-relevant documents that are retrieved  out of all non-relevant documents available 

In binary classification  fall-out is closely related to specificity  1 - specificity   It can be looked at as the probability that a non-relevant document is retrieved by the query 
It is trivial to achieve fall-out of 0% by returning zero documents in response to any query 
 F-measure
Main article  F-score
The weighted harmonic mean of precision and recall  the traditional F-measure or balanced F-score is 

This is also known as the F1 measure  because recall and precision are evenly weighted 
The general formula for non-negative real  is 
 
Two other commonly used F measures are the F2 measure  which weights recall twice as much as precision  and the F0 5 measure  which weights precision twice as much as recall 
The F-measure was derived by van Rijsbergen  1979  so that F  measures the effectiveness of retrieval with respect to a user who attaches  times as much importance to recall as precision   It is based on van Rijsbergen s effectiveness measure E = 1 -  1 /  a / P +  1 - a  / R    Their relationship is F = 1 - E where a = 1 /  2 + 1  
 Average precision of precision and recall
The precision and recall are based on the whole list of documents returned by the system  Average precision emphasizes returning more relevant documents earlier  It is average of precisions computed after truncating the list after each of the relevant documents in turn 


where r is the rank  N the number retrieved  rel   a binary function on the relevance of a given rank  and P   precision at a given cut-off rank 
NOTE  number of relevant documents is actually number of relevant documents in the entire COLLECTION  Otherwise this is ambiguous when dealing with a hitlist that isnt the entire collection  See     There is very little written about MAP even though it is a widely used metric  The argument for entire collection is that otherwise you don t get a sense of recall 
 Model types



Categorization of IR-models  translated from German entry  original source Dominik Kuropka  
For the information retrieval to be efficient  the documents are typically transformed into a suitable representation  There are several representations  The picture on the right illustrates the relationship of some common models  In the picture  the models are categorized according to two dimensions  the mathematical basis and the properties of the model 
 First dimension  mathematical basis
Set-theoretic models represent documents as sets of words or phrases  Similarities are usually derived from set-theoretic operations on those sets  Common models are 
Standard Boolean model
Extended Boolean model
Fuzzy retrieval
Algebraic models represent documents and queries usually as vectors  matrices or tuples  The similarity of the query vector and document vector is represented as a scalar value 
Vector space model
Generalized vector space model
Topic-based vector space model  literature       
Extended Boolean model
Enhanced topic-based vector space model  literature       
Latent semantic indexing aka latent semantic analysis
Probabilistic models treat the process of document retrieval as a probabilistic inference  Similarities are computed as probabilities that a document is relevant for a given query  Probabilistic theorems like the Bayes  theorem are often used in these models 
Binary independence retrieval
Probabilistic relevance model  BM25 
Uncertain inference
Language models
Divergence-from-randomness model
Latent Dirichlet allocation
 Second dimension  properties of the model
Models without term-interdependencies treat different terms/words as independent  This fact is usually represented in vector space models by the orthogonality assumption of term vectors or in probabilistic models by an independency assumption for term variables 
Models with immanent term interdependencies allow a representation of interdependencies between terms  However the degree of the interdependency between two terms is defined by the model itself  It is usually directly or indirectly derived  e g  by dimensional reduction  from the co-occurrence of those terms in the whole set of documents 
Models with transcendent term interdependencies allow a representation of interdependencies between terms  but they do not allege how the interdependency between two terms is defined  They relay an external source for the degree of interdependency between two terms   For example a human or sophisticated algorithms  
 Major figures

Thomas Bayes
Claude E  Shannon
Gerard Salton
Hans Peter Luhn
W  Bruce Croft
Karen Sprck Jones
C  J  van Rijsbergen
Stephen E  Robertson
Martin Porter
 Awards in the field

Tony Kent Strix award
Gerard Salton Award
 See also

Adversarial information retrieval
Areas of IR application
Clustering
Compound term processing
Controlled vocabulary
Cross-language information retrieval
Educational psychology
European Summer School in Information Retrieval
Free text search
Gain
Human Computer Information Retrieval
Information extraction
Information need
Information Retrieval Facility
Information science
Knowledge visualization
Multisearch
Personal information management
Relevance  Information Retrieval 
Relevance feedback
Subject indexing
Search index
Selection-based search
Tf-idf
XML-Retrieval